NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions

نویسندگان

  • Hong-Jie Dai
  • Onkar Singh
  • Jitendra Jonnagaddala
  • Emily Chia-Yu Su
چکیده

In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions

A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provid...

متن کامل

Species taxonomy for gene normalization

Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions. Using these identifiers a great deal of information can be gathered from external databases such as interactions, pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the inter-species ambiguity of the gene mentions in biomedical p...

متن کامل

Species taxonomy for gene name normalization

Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions. Using these identifiers a great deal of information can be gathered from external databases such as interactions, pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the inter-species ambiguity of the gene mentions in biomedical p...

متن کامل

Collective Instance-Level Gene Normalization on the IGN Corpus

A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made sign...

متن کامل

Development of bespoke machine learning and biocuration workflows in a BioC-supporting text mining workbench

As part of our participation in the Collaborative Biocurator Assistant Task of BioCreative V, we developed methods and tools for recognising and normalising mentions denoting genes/proteins and organisms. A combination of different approaches were used in addressing these tasks. The recognition of gene/protein and organism names was cast as a sequence labelling problem to which the conditional ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2016  شماره 

صفحات  -

تاریخ انتشار 2016